{
 "cells": [
  {
   "cell_type": "raw",
   "id": "913dd5a2-24d1-4f8e-bc15-ab518483eef9",
   "metadata": {},
   "source": [
    "---\n",
    "title: Handle long text\n",
    "sidebar_position: 2\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e161a8a-fcf0-4d55-933e-da271ce28d7e",
   "metadata": {},
   "source": [
    "When working with files, like PDFs, you're likely to encounter text that exceeds your language model's context window. To process this text, consider these strategies:\n",
    "\n",
    "1. **Change LLM** Choose a different LLM that supports a larger context window.\n",
    "2. **Brute Force** Chunk the document, and extract content from each chunk.\n",
    "3. **RAG** Chunk the document, index the chunks, and only extract content from a subset of chunks that look \"relevant\".\n",
    "\n",
    "Keep in mind that these strategies have different trade offs and the best strategy likely depends on the application that you're designing!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57969139-ad0a-487e-97d8-cb30e2af9742",
   "metadata": {},
   "source": [
    "## Set up\n",
    "\n",
    "First, let's install some required dependencies:\n",
    "\n",
    "```{=mdx}\n",
    "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n",
    "import Npm2Yarn from \"@theme/Npm2Yarn\";\n",
    "\n",
    "<IntegrationInstallTooltip></IntegrationInstallTooltip>\n",
    "\n",
    "<Npm2Yarn>\n",
    "  @langchain/openai zod cheerio\n",
    "</Npm2Yarn>\n",
    "```\n",
    "\n",
    "Next, we need some example data! Let's download an article about [cars from Wikipedia](https://en.wikipedia.org/wiki/Car) and load it as a LangChain `Document`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "571aad22-2cec-4b9b-b656-5e4b81a1ec6c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Module: null prototype] {\n",
       "  contains: \u001b[36m[Function: contains]\u001b[39m,\n",
       "  default: [Function: initialize] {\n",
       "    contains: \u001b[36m[Function: contains]\u001b[39m,\n",
       "    html: \u001b[36m[Function: html]\u001b[39m,\n",
       "    merge: \u001b[36m[Function: merge]\u001b[39m,\n",
       "    parseHTML: \u001b[36m[Function: parseHTML]\u001b[39m,\n",
       "    root: \u001b[36m[Function: root]\u001b[39m,\n",
       "    text: \u001b[36m[Function: text]\u001b[39m,\n",
       "    xml: \u001b[36m[Function: xml]\u001b[39m,\n",
       "    load: \u001b[36m[Function: load]\u001b[39m,\n",
       "    _root: Document {\n",
       "      parent: \u001b[1mnull\u001b[22m,\n",
       "      prev: \u001b[1mnull\u001b[22m,\n",
       "      next: \u001b[1mnull\u001b[22m,\n",
       "      startIndex: \u001b[1mnull\u001b[22m,\n",
       "      endIndex: \u001b[1mnull\u001b[22m,\n",
       "      children: [],\n",
       "      type: \u001b[32m\"root\"\u001b[39m\n",
       "    },\n",
       "    _options: { xml: \u001b[33mfalse\u001b[39m, decodeEntities: \u001b[33mtrue\u001b[39m },\n",
       "    fn: Cheerio {}\n",
       "  },\n",
       "  html: \u001b[36m[Function: html]\u001b[39m,\n",
       "  load: \u001b[36m[Function: load]\u001b[39m,\n",
       "  merge: \u001b[36m[Function: merge]\u001b[39m,\n",
       "  parseHTML: \u001b[36m[Function: parseHTML]\u001b[39m,\n",
       "  root: \u001b[36m[Function: root]\u001b[39m,\n",
       "  text: \u001b[36m[Function: text]\u001b[39m,\n",
       "  xml: \u001b[36m[Function: xml]\u001b[39m\n",
       "}"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import { CheerioWebBaseLoader } from \"langchain/document_loaders/web/cheerio\";\n",
    "// Only required in a Deno notebook environment to load the peer dep.\n",
    "import \"cheerio\";\n",
    "\n",
    "const loader = new CheerioWebBaseLoader(\n",
    "  \"https://en.wikipedia.org/wiki/Car\"\n",
    ");\n",
    "\n",
    "const docs = await loader.load();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "85656454-6d5d-4ff6-93ca-690791ac1ec4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\u001b[33m95865\u001b[39m"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[0].pageContent.length;"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af3ffb8d-587a-4370-886a-e56e617bcb9c",
   "metadata": {},
   "source": [
    "## Define the schema\n",
    "\n",
    "Here, we'll define schema to extract key developments from the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a3b288ed-87a6-4af0-aac8-20921dc370d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { z } from \"zod\";\n",
    "import { ChatPromptTemplate } from \"@langchain/core/prompts\";\n",
    "import { ChatOpenAI } from \"@langchain/openai\";\n",
    "\n",
    "const keyDevelopmentSchema = z.object({\n",
    "  year: z.number().describe(\"The year when there was an important historic development.\"),\n",
    "  description: z.string().describe(\"What happened in this year? What was the development?\"),\n",
    "  evidence: z.string().describe(\"Repeat verbatim the sentence(s) from which the year and description information were extracted\"),\n",
    "}).describe(\"Information about a development in the history of cars.\");\n",
    "\n",
    "const extractionDataSchema = z.object({\n",
    "  key_developments: z.array(keyDevelopmentSchema),\n",
    "}).describe(\"Extracted information about key developments in the history of cars\");\n",
    "\n",
    "const SYSTEM_PROMPT_TEMPLATE = [\n",
    "  \"You are an expert at identifying key historic development in text.\",\n",
    "  \"Only extract important historic developments. Extract nothing if no important information can be found in the text.\"\n",
    "].join(\"\\n\");\n",
    "\n",
    "// Define a custom prompt to provide instructions and any additional context.\n",
    "// 1) You can add examples into the prompt template to improve extraction quality\n",
    "// 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
    "//    about the document from which the text was extracted.)\n",
    "const prompt = ChatPromptTemplate.fromMessages([\n",
    "  [\n",
    "    \"system\",\n",
    "    SYSTEM_PROMPT_TEMPLATE,\n",
    "  ],\n",
    "  // Keep on reading through this use case to see how to use examples to improve performance\n",
    "  // MessagesPlaceholder('examples'),\n",
    "  [\"human\", \"{text}\"],\n",
    "]);\n",
    "\n",
    "// We will be using tool calling mode, which\n",
    "// requires a tool calling capable model.\n",
    "const llm = new ChatOpenAI({\n",
    "  modelName: \"gpt-4-0125-preview\",\n",
    "  temperature: 0,\n",
    "});\n",
    "\n",
    "const extractionChain = prompt.pipe(llm.withStructuredOutput(extractionDataSchema));"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13aebafb-26b5-42b2-ae8e-9c05cd56e5c5",
   "metadata": {},
   "source": [
    "## Brute force approach\n",
    "\n",
    "Split the documents into chunks such that each chunk fits into the context window of the LLMs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "27b8a373-14b3-45ea-8bf5-9749122ad927",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { TokenTextSplitter } from \"langchain/text_splitter\";\n",
    "\n",
    "const textSplitter = new TokenTextSplitter({\n",
    "  chunkSize: 2000,\n",
    "  chunkOverlap: 20,\n",
    "});\n",
    "\n",
    "// Note that this method takes an array of docs\n",
    "const splitDocs = await textSplitter.splitDocuments(docs);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b43d7e0-3c85-4d97-86c7-e8c984b60b0a",
   "metadata": {},
   "source": [
    "Use the `.batch` method present on all runnables to run the extraction in **parallel** across each chunk! \n",
    "\n",
    ":::{.callout-tip}\n",
    "You can often use `.batch()` to parallelize the extractions!\n",
    "\n",
    "If your model is exposed via an API, this will likely speed up your extraction flow.\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6ba766b5-8d6c-48e6-8d69-f391a66b65d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "// Limit just to the first 3 chunks\n",
    "// so the code can be re-run quickly\n",
    "const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);\n",
    "\n",
    "const extractionChainParams = firstFewTexts.map((text) => {\n",
    "  return { text };\n",
    "});\n",
    "\n",
    "const results = await extractionChain.batch(extractionChainParams, { maxConcurrency: 5 });"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67da8904-e927-406b-a439-2a16f6087ccf",
   "metadata": {},
   "source": [
    "### Merge results\n",
    "\n",
    "After extracting data from across the chunks, we'll want to merge the extractions together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "30b35897-4d94-44ad-80c6-446eff61b76b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\n",
       "  { year: \u001b[33m0\u001b[39m, description: \u001b[32m\"\"\u001b[39m, evidence: \u001b[32m\"\"\u001b[39m },\n",
       "  {\n",
       "    year: \u001b[33m1769\u001b[39m,\n",
       "    description: \u001b[32m\"French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle.\"\u001b[39m,\n",
       "    evidence: \u001b[32m\"French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.\"\u001b[39m\n",
       "  },\n",
       "  {\n",
       "    year: \u001b[33m1808\u001b[39m,\n",
       "    description: \u001b[32m\"French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu\"\u001b[39m... 25 more characters,\n",
       "    evidence: \u001b[32m\"French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu\"\u001b[39m... 33 more characters\n",
       "  },\n",
       "  {\n",
       "    year: \u001b[33m1886\u001b[39m,\n",
       "    description: \u001b[32m\"German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,\"\u001b[39m... 40 more characters,\n",
       "    evidence: \u001b[32m\"The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German\"\u001b[39m... 56 more characters\n",
       "  },\n",
       "  {\n",
       "    year: \u001b[33m1908\u001b[39m,\n",
       "    description: \u001b[32m\"The 1908 Model T, an American car manufactured by the Ford Motor Company, became one of the first ca\"\u001b[39m... 28 more characters,\n",
       "    evidence: \u001b[32m\"One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by\"\u001b[39m... 24 more characters\n",
       "  }\n",
       "]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "const keyDevelopments = results.flatMap((result) => result.key_developments);\n",
    "\n",
    "keyDevelopments.slice(0, 20);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48afd4a7-abcd-48b4-8ff1-6ca485f529e3",
   "metadata": {},
   "source": [
    "## RAG based approach\n",
    "\n",
    "Another simple idea is to chunk up the text, but instead of extracting information from every chunk, just focus on the the most relevant chunks.\n",
    "\n",
    ":::{.callout-caution}\n",
    "It can be difficult to identify which chunks are relevant.\n",
    "\n",
    "For example, in the `car` article we're using here, most of the article contains key development information. So by using\n",
    "**RAG**, we'll likely be throwing out a lot of relevant information.\n",
    "\n",
    "We suggest experimenting with your use case and determining whether this approach works or not.\n",
    ":::\n",
    "\n",
    "Here's a simple example that relies on an in-memory demo `MemoryVectorStore` vectorstore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "aaf37c82-625b-4fa1-8e88-73303f08ac16",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { MemoryVectorStore } from \"langchain/vectorstores/memory\";\n",
    "import { OpenAIEmbeddings } from \"@langchain/openai\";\n",
    "\n",
    "// Only load the first 10 docs for speed in this demo use-case\n",
    "const vectorstore = await MemoryVectorStore.fromDocuments(\n",
    "  splitDocs.slice(0, 10),\n",
    "  new OpenAIEmbeddings()\n",
    ");\n",
    "\n",
    "// Only extract from top document\n",
    "const retriever = vectorstore.asRetriever({ k: 1 });"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "013ecad9-f80f-477c-b954-494b46a02a07",
   "metadata": {},
   "source": [
    "In this case the RAG extractor is only looking at the top document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "47aad00b-7013-4f7f-a1b0-02ef269093bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { RunnableSequence } from \"@langchain/core/runnables\";\n",
    "\n",
    "const ragExtractor = RunnableSequence.from([\n",
    "  {\n",
    "    text: retriever.pipe((docs) => docs[0].pageContent)\n",
    "  },\n",
    "  extractionChain,\n",
    "]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "68f2de01-0cd8-456e-a959-db236189d41b",
   "metadata": {},
   "outputs": [],
   "source": [
    "const results = await ragExtractor.invoke(\"Key developments associated with cars\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "56f434ea-1869-4192-914e-3ccf64e72f75",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\n",
       "  {\n",
       "    year: \u001b[33m2020\u001b[39m,\n",
       "    description: \u001b[32m\"The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1.\"\u001b[39m... 33 more characters,\n",
       "    evidence: \u001b[32m\"The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2\"\u001b[39m... 31 more characters\n",
       "  },\n",
       "  {\n",
       "    year: \u001b[33m2030\u001b[39m,\n",
       "    description: \u001b[32m\"All fossil fuel vehicles will be banned in Amsterdam from 2030.\"\u001b[39m,\n",
       "    evidence: \u001b[32m\"all fossil fuel vehicles will be banned in Amsterdam from 2030.\"\u001b[39m\n",
       "  },\n",
       "  {\n",
       "    year: \u001b[33m2020\u001b[39m,\n",
       "    description: \u001b[32m\"In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.\"\u001b[39m,\n",
       "    evidence: \u001b[32m\"In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.\"\u001b[39m\n",
       "  }\n",
       "]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.key_developments;"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf36e626-cf5d-4324-ba29-9bd602be9b97",
   "metadata": {},
   "source": [
    "## Common issues\n",
    "\n",
    "Different methods have their own pros and cons related to cost, speed, and accuracy.\n",
    "\n",
    "Watch out for these issues:\n",
    "\n",
    "* Chunking content means that the LLM can fail to extract information if the information is spread across multiple chunks.\n",
    "* Large chunk overlap may cause the same information to be extracted twice, so be prepared to de-duplicate!\n",
    "* LLMs can make up data. If looking for a single fact across a large text and using a brute force approach, you may end up getting more made up data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Deno",
   "language": "typescript",
   "name": "deno"
  },
  "language_info": {
   "file_extension": ".ts",
   "mimetype": "text/x.typescript",
   "name": "typescript",
   "nb_converter": "script",
   "pygments_lexer": "typescript",
   "version": "5.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
